Skip to content

fix: lead responsiveness (error suppression, circuit breaker, routing, auth)#2328

Merged
btucker merged 12 commits intomainfrom
blindspots/dogfood/20260404-073834
Apr 4, 2026
Merged

fix: lead responsiveness (error suppression, circuit breaker, routing, auth)#2328
btucker merged 12 commits intomainfrom
blindspots/dogfood/20260404-073834

Conversation

@btucker
Copy link
Copy Markdown
Owner

@btucker btucker commented Apr 4, 2026

Summary

Six fixes for channel lead responsiveness, found via dogfood testing:

1. Suppress error output from auto-posting to channels

  • Session error text (e.g. expired OAuth 401) was auto-posted as normal output

2. Add circuit breaker for lead spawn failures

  • Leads had no max retry limit — crash-looped forever
  • MAX_LEAD_RESTARTS=3, escalates to ops

3. Topic channel @ALL now finds workers in dm channels

  • Workers in dm-{name} were invisible to channel-scoped @all

4+5. Nudge resume/respawn respects spawn failure cooldowns

  • Both ResumeAndDeliver and RespawnAndDeliver now check cooldowns

6. Replace broken OAuth code flow with direct claude auth login

  • Old flow (BROWSER=false + paste code) didn't work
  • New flow: Login button runs claude auth login directly, opens native browser
  • Removed: /api/auth/login/code, AuthLoginProcess, pending_auth_login

Spec updates: v2-spec.md §1.4, §4.1, §4.4

Test plan

  • flush_auto_output_suppresses_error_results / _posts_normal_results
  • lead_gives_up_after_max_spawn_failures
  • at_all_in_topic_finds_workers_in_dm_channels
  • stopped_lead_with_active_cooldown_resolves_to_drop
  • stopped_worker_with_active_cooldown_resolves_to_drop
  • stopped_lead_with_session_id_and_active_cooldown_resolves_to_drop
  • Full test suite: 1534 lib tests, 0 failures
  • cargo clippy + cargo fmt clean

🤖 Generated with Claude Code

When a Claude session exits with an error (e.g. expired OAuth token),
the error text was being auto-posted to the channel as if it were
normal assistant output. This caused raw JSON API errors to appear
in user-facing channels.

Now, when flush_auto_output detects a Result event with is_error: true,
it clears the pending events instead of posting them. Errors are still
logged to daemon.log for debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 4, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.93%. Comparing base (bc6b661) to head (a7d5ede).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/daemon_v2/web/routes.rs 0.00% 17 Missing ⚠️
src/daemon_v2/executor/mod.rs 84.37% 5 Missing ⚠️
src/daemon_v2/daemon.rs 55.55% 4 Missing ⚠️
src/daemon_v2/decisions/health.rs 88.23% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2328      +/-   ##
==========================================
- Coverage   64.00%   63.93%   -0.07%     
==========================================
  Files         100      100              
  Lines       38588    38578      -10     
==========================================
- Hits        24699    24666      -33     
- Misses      13889    13912      +23     
Files with missing lines Coverage Δ
src/daemon_v2/decisions/chat.rs 96.56% <100.00%> (+0.13%) ⬆️
src/daemon_v2/web/mod.rs 100.00% <ø> (ø)
src/daemon_v2/decisions/health.rs 88.78% <88.23%> (+0.48%) ⬆️
src/daemon_v2/daemon.rs 66.53% <55.55%> (-4.22%) ⬇️
src/daemon_v2/executor/mod.rs 62.92% <84.37%> (-9.21%) ⬇️
src/daemon_v2/web/routes.rs 3.17% <0.00%> (+0.20%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7292e8777a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/daemon_v2/executor/mod.rs Outdated
Comment on lines +688 to +692
let has_error_result = events.iter().any(|e| {
matches!(
e,
crate::headless::StreamEvent::Result { is_error: true, .. }
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Defer error suppression until turn end

flush_auto_output suppresses only when the current events batch already contains StreamEvent::Result { is_error: true }, but drain_session_output also flushes on the 2-second timer before the result event arrives. In error cases where an assistant error message is emitted first and the timer fires before the Result, this branch is bypassed and the raw error text is posted; the later error result then clears an empty buffer. That means the new “suppress error output” behavior is still violated under this timing.

Useful? React with 👍 / 👎.

Channel leads had no max retry limit — when a lead's auth expired, it
would crash, wait 120s, crash again, and repeat forever. Workers already
had MAX_WORKER_RESTARTS=3 with ops escalation, but leads were missing
this protection.

Now ensure_lead_for_channel checks failure_count and stops retrying
after MAX_LEAD_RESTARTS=3 consecutive failures, posting to ops instead.
The daemon also uses the correct per-kind max and escalation key format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@btucker btucker changed the title fix: suppress error output from being auto-posted to channels fix: suppress error auto-output + add lead spawn circuit breaker Apr 4, 2026
Workers are spawned with channel=dm-{name}, not the task's channel.
This meant @ALL in a topic channel only found agents indexed under
that channel (the lead), missing workers whose tasks belong there.

The fix adds a second scan after the by_channel lookup that finds
workers by checking task.channel instead of agent.channel. The nudge
dedup set prevents double-nudging workers that happen to be indexed
in the correct channel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@btucker btucker changed the title fix: suppress error auto-output + add lead spawn circuit breaker fix: error suppression, lead circuit breaker, and topic @all routing Apr 4, 2026
resolve_nudge_action returned RespawnAndDeliver for any stopped agent
with no session_id, regardless of whether a SpawnFailure cooldown was
active. This meant every user message to a dead lead triggered a new
spawn attempt, bypassing both the 120s cooldown and the circuit
breaker (MAX_LEAD_RESTARTS) from ensure_channel_leads_alive.

Now resolve_nudge_action checks is_active(SpawnFailure, key) before
returning RespawnAndDeliver. Leads use channel as key, workers use
task_id.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@btucker btucker changed the title fix: error suppression, lead circuit breaker, and topic @all routing fix: error suppression, lead circuit breaker, @all routing, nudge cooldowns Apr 4, 2026
btucker and others added 2 commits April 4, 2026 08:24
The Cycle 4 fix only added cooldown checks to RespawnAndDeliver (no
session_id). ResumeAndDeliver (has session_id) was unchecked, meaning
agents with preserved session_ids (e.g. auth errors that don't trigger
"No conversation found") could be resumed on every user message.

Moved the cooldown check before the session_id branch so it applies
to both resume and respawn paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The web UI's Re-authenticate button triggered an OAuth code flow
(BROWSER=false, capture URL, paste code) that didn't work in practice.
The only reliable auth method is running `claude auth login` directly,
which opens the browser natively for OAuth.

Changes:
- Backend: simplified auth_login route to run `claude auth login`
  without BROWSER=false, waits up to 5 min for completion, then
  restarts all agents
- Removed: /api/auth/login/code endpoint, AuthLoginProcess struct,
  pending_auth_login state
- UI: replaced code-paste flow with simple Login button that shows
  "Logging in" state while waiting for browser auth to complete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@btucker btucker changed the title fix: error suppression, lead circuit breaker, @all routing, nudge cooldowns fix: lead responsiveness (error suppression, circuit breaker, routing, auth) Apr 4, 2026
Addresses Codex review: flush_auto_output's error check only worked
when the Result event was in the same batch as the error text. If the
2-second timer fired between the error text and the Result, the text
leaked through.

Moved error suppression from flush_auto_output to the drain loop:
- Set session_errored flag when Result { is_error: true } is received
- Skip timer flushes, stream-end flushes, and turn-end flushes when
  the flag is set
- flush_auto_output is now a pure extract-and-post function

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@btucker
Copy link
Copy Markdown
Owner Author

btucker commented Apr 4, 2026

Addressed Codex review in bf69bb3:

P1 — Defer error suppression until turn end: Moved the error check from flush_auto_output to a session_errored flag in the drain loop. The flag is set when Result { is_error: true } arrives, and all subsequent flushes (timer, stream-end, turn-end) are suppressed. This closes the race where the 2-second timer could fire between the error text and the Result event.

btucker and others added 5 commits April 4, 2026 09:40
Pre-existing biome error exposed by touching Channel.svelte.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pre-existing biome noNonNullAssertion error in store.test.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Biome noNonNullAssertion errors — replace ! with optional chaining
or nullish coalescing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Downgrade noNonNullAssertion to warn for .svelte files (reactive
  code uses $store! assertions extensively and they are valid at
  runtime)
- Apply biome auto-fix for Channel.svelte optional chaining
- Apply biome format fix for api.ts startAuthLogin signature

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AccountPanel.svelte also had the old startAuthLogin + submitAuthCode
flow. Updated to use the simplified direct-login approach (matching
Channel.svelte changes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@btucker btucker merged commit b47707c into main Apr 4, 2026
13 checks passed
@btucker btucker deleted the blindspots/dogfood/20260404-073834 branch April 4, 2026 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant